Fault Tolerance within a Grid Environment
نویسندگان
چکیده
Fault tolerance is an important property in Grid computing as the dependability of individual Grid resources may not be able to be guaranteed; also as resources are used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious in some way. As part of the e-Demand project at the University of Durham we are seeking to develop both an improved fault model for Grid computing and a method for providing fault tolerance for Grid applications that will provide protection against both malicious and erroneous services. We have firstly begun to investigate whether the traditional distributed systems fault model can be readily applied to Grid computing, or whether improvements and alterations need to be made. From our initial investigation, we have concluded that timing, omission and interaction faults may become more prevalent in Grid applications than is the case in traditional distributed systems. From this initial fault model, we have begun to develop an approach for fault tolerance based on the idea of job replication, as anomalous results (either maliciously altered or simply wrong) should be caught at the voting stage. This approach combines a replication-based fault tolerance approach with both dynamic prioritization and dynamic scheduling.
منابع مشابه
Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملBuilding Fault-Tolerant Consistency Protocols for an Adaptive Grid Data-Sharing Service
We address the challenge of sharing large amounts of numerical data within computing grids consisting of clusters federation. We focus on the problem of handling the consistency of replicated data in an environment where the availability of storage resources dynamically changes. We propose a software architecture which decouples consistency management from fault-tolerance management. We illustr...
متن کاملIndependent checkpointing in a heterogeneous grid environment
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support different checkpointing protocols and to address the underlying gridnode checkpointers (e.g. BLCR, LinuxSSI, OpenVZ, etc...
متن کاملTopology-aware Fault-Tolerance in Service-Oriented Grids
A promising means of attaining dependability improvements within service-oriented Grid environments is through the set of methods comprising design-diversity software fault-tolerance. However, applying such methods in a Grid environment requires the resolution of a new problem not faced in traditional systems, namely the common service problem whereby multiple disparate but functionally-equival...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003